Problem 1

Report the dimension of election.raw.

dim(election.raw)
## [1] 31167     5
election.raw
## # A tibble: 31,167 x 5
##    state    county     candidate     party  votes
##    <chr>    <chr>      <fct>         <fct>  <dbl>
##  1 Delaware Kent       Joe Biden     DEM    44518
##  2 Delaware Kent       Donald Trump  REP    40976
##  3 Delaware Kent       Jo Jorgensen  LIB     1044
##  4 Delaware Kent       Howie Hawkins GRN      420
##  5 Delaware Kent       Write-ins     WRI        0
##  6 Delaware New Castle Joe Biden     DEM   194238
##  7 Delaware New Castle Donald Trump  REP    87685
##  8 Delaware New Castle Jo Jorgensen  LIB     2932
##  9 Delaware New Castle Howie Hawkins GRN     1278
## 10 Delaware New Castle Write-ins     WRI        0
## # … with 31,157 more rows

In election.raw dataset, there are 31167 rows and 5 columns.

Are there missing values in the data set?

sum(is.na(election.raw))
## [1] 0

No, there are no missing values in the data set.

Compute the total number of distinct values in state in election.raw to verify that the data contains all states and a federal district.

unique(election.raw['state'])
## # A tibble: 51 x 1
##    state               
##    <chr>               
##  1 Delaware            
##  2 District of Columbia
##  3 Florida             
##  4 Georgia             
##  5 Hawaii              
##  6 Idaho               
##  7 Illinois            
##  8 Indiana             
##  9 Iowa                
## 10 Kansas              
## # … with 41 more rows
nrow(unique(election.raw['state']))
## [1] 51

There are 51 unique values in the 'state' column. This includes the 50 states in the United States of America, and includes the District of Columbia.

Problem 2

Report the dimension of census.

dim(census)
## [1] 3220   37

In census data set, there are 3220 rows and 37 columns.

Are there missing values in the data set?

sum(is.na(census))
## [1] 1

Yes, there is a missing value in the data set.

Compute the total number of distinct values in county in census with that in election.raw. Comment on your findings.

##unique county data
nrow(unique(census['County']))
## [1] 1955
unique(census['County'])
## # A tibble: 1,955 x 1
##    County         
##    <chr>          
##  1 Autauga County 
##  2 Baldwin County 
##  3 Barbour County 
##  4 Bibb County    
##  5 Blount County  
##  6 Bullock County 
##  7 Butler County  
##  8 Calhoun County 
##  9 Chambers County
## 10 Cherokee County
## # … with 1,945 more rows
##unique election.raw data
nrow(unique(election.raw['county']))
## [1] 2825
unique(election.raw['county'])
## # A tibble: 2,825 x 1
##    county              
##    <chr>               
##  1 Kent                
##  2 New Castle          
##  3 Sussex              
##  4 District of Columbia
##  5 Ward 2              
##  6 Ward 3              
##  7 Ward 4              
##  8 Ward 5              
##  9 Ward 6              
## 10 Ward 7              
## # … with 2,815 more rows

There are 1995 unique counties in the census data set, while there are 2825 unique counties in the election.raw data set. ##COMMENT ON FINDINGS

Problem 3

Construct aggregated data sets from election.raw data.

election.state= election.raw %>% group_by(state, candidate) %>% summarise_each(funs(mean), votes)
## Warning: `summarise_each_()` is deprecated as of dplyr 0.7.0.
## Please use `across()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
election.state
## # A tibble: 351 x 3
## # Groups:   state [51]
##    state   candidate          votes
##    <chr>   <fct>              <dbl>
##  1 Alabama Donald Trump    21405.  
##  2 Alabama Jo Jorgensen      373.  
##  3 Alabama Joe Biden       12589.  
##  4 Alabama Write-ins         109.  
##  5 Alaska  Brock Pierce        7.42
##  6 Alaska  Don Blankenship     8.7 
##  7 Alaska  Donald Trump     2025.  
##  8 Alaska  Jesse Ventura      22.8 
##  9 Alaska  Jo Jorgensen       87.0 
## 10 Alaska  Joe Biden        1144.  
## # … with 341 more rows
election.total= election.raw %>% summarise_each(funs(sum), votes)
election.total
## # A tibble: 1 x 1
##       votes
##       <dbl>
## 1 150361237

Problem 4

How many named presidential candidates were there in the 2020 election?.

unique(election.raw['candidate'])
## # A tibble: 38 x 1
##    candidate         
##    <fct>             
##  1 Joe Biden         
##  2 Donald Trump      
##  3 Jo Jorgensen      
##  4 Howie Hawkins     
##  5 Write-ins         
##  6 Gloria La Riva    
##  7 Brock Pierce      
##  8 Rocky De La Fuente
##  9 Don Blankenship   
## 10 Kanye West        
## # … with 28 more rows

There were 36 presidential candidates in the 2020 elections.

Draw a bar chart of all votes received by each candidate.

candid= election.raw %>% group_by(candidate) %>% summarise_each(funs(sum), votes)
cand1=unlist(candid[1],use.names=FALSE)
cand_num1= unlist(candid[2],use.names=FALSE)

#plot showing how many people voted for Joe Biden and Donald Trump, relative to other candidates
barplot(cand_num1, main='Total Votes per Candidate', xlab= 'candidates', ylab= 'total votes ', col='#69b3a2', names.arg=cand1)

This first chart shows the total sum of votes per candidate during the 2020 presidential election. Looking at this chart, it is apparent how popular Joe Biden and Donald Trump were compared to the other candidates.

#bar charts comparing all candidates
par(mfrow=c(1,3))
barplot(log(cand_num1[c(1:4)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE, names.arg=cand1[c(1:4)], col='#69b3a2', xlim = c(0, 20))
barplot(log(cand_num1[c(5:7)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE, names.arg=cand1[c(5:7)], col='#69b3a2',xlim = c(0, 20))
barplot(log(cand_num1[c(8:11)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE, names.arg=cand1[c(8:11)], col='#69b3a2',xlim = c(0, 20))

par(mfrow=c(1,3))
barplot(log(cand_num1[c(12:15)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE, names.arg=cand1[c(12:15)], col='#69b3a2',xlim = c(0, 20))
barplot(log(cand_num1[c(16:19)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(16:19)], col='#69b3a2',xlim = c(0, 20))
barplot(log(cand_num1[c(20:22)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(20:22)], col='#69b3a2',xlim = c(0, 20))

par(mfrow=c(1,3))
barplot(log(cand_num1[c(24:27)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(24:27)], col='#69b3a2',xlim = c(0, 20))
barplot(log(cand_num1[c(28:30)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(28:30)], col='#69b3a2',xlim = c(0, 20))
barplot(log(cand_num1[c(31:33)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(31:33)], col='#69b3a2',xlim = c(0, 20))

par(mfrow=c(1,2))
barplot(log(cand_num1[c(34,36)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(34,36)], col='#69b3a2',xlim = c(0, 20))
barplot(log(cand_num1[c(35, 37)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(35, 37)], col='#69b3a2',xlim = c(0, 20))

barplot(log(cand_num1[c(23,38)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(23, 38)], col='#69b3a2',xlim = c(0, 20))

These charts show the total number of votes per candidate, in comparison of the other candidates. To make reading the bar charts easier, all bar charts have been put on a log-scale.

Problem 5

Create data sets county.winner and state.winner by taking the candidates with the highest proportion of votes in both county level and state level.

subs= election.raw%>% group_by(state,county) %>% mutate(total = sum(votes),pct= votes/sum(votes))

county.winner= subs %>% arrange(county, desc(pct))
county.winner= top_n(county.winner, 1)
## Selecting by pct
county.winner
## # A tibble: 4,481 x 7
## # Groups:   state, county [4,478]
##    state          county        candidate    party  votes  total   pct
##    <chr>          <chr>         <fct>        <fct>  <dbl>  <dbl> <dbl>
##  1 South Carolina Abbeville     Donald Trump REP     9343  14552 0.642
##  2 Maine          Abbot         Donald Trump REP      288    417 0.691
##  3 Massachusetts  Abington      Joe Biden    DEM     5127   9550 0.537
##  4 Louisiana      Acadia Parish Donald Trump REP    22596  28425 0.795
##  5 Virginia       Accomack      Donald Trump REP     9172  16938 0.542
##  6 Massachusetts  Acton         Joe Biden    DEM    10793  13533 0.798
##  7 Maine          Acton         Donald Trump REP      930   1636 0.568
##  8 Massachusetts  Acushnet      Donald Trump REP     3193   5841 0.547
##  9 New Hampshire  Acworth       Joe Biden    DEM      293    569 0.515
## 10 Idaho          Ada           Donald Trump REP   130699 259389 0.504
## # … with 4,471 more rows
state.winner= election.raw %>% group_by(state, candidate) %>% summarise_each(funs(sum), votes) %>% mutate(pct= votes/sum(votes))
state.winner= top_n(state.winner,1)
## Selecting by pct
state.winner
## # A tibble: 51 x 4
## # Groups:   state [51]
##    state                candidate      votes   pct
##    <chr>                <fct>          <dbl> <dbl>
##  1 Alabama              Donald Trump 1434159 0.621
##  2 Alaska               Donald Trump   80999 0.614
##  3 Arizona              Joe Biden    1643664 0.495
##  4 Arkansas             Donald Trump  761251 0.626
##  5 California           Joe Biden    9315259 0.646
##  6 Colorado             Joe Biden    1753416 0.553
##  7 Connecticut          Joe Biden    1059252 0.593
##  8 Delaware             Joe Biden     295413 0.588
##  9 District of Columbia Joe Biden     258561 0.926
## 10 Florida              Donald Trump 5667834 0.512
## # … with 41 more rows

Problem 6

Draw county-level map by creating counties= map_data("county"). Color by county.

county <- map_data("county")

ggplot(data = county) + 
  geom_polygon(aes(x = long, y = lat, fill = subregion, group = group),
               color = "white") + 
  coord_fixed(1.3) +
  guides(fill=FALSE)

Problem 7

Now color the map by the winning candidate for each state.

#to make both state/ column names are the same
colnames(state.winner)= c("region", "candidate", "votes", "pct")
state.winner$region= tolower(state.winner$region)
states <- map_data("state")
#joining dataframes
state_join= left_join(states, state.winner, by= 'region')

#map code

ggplot(data = states) + ggtitle("Presidential Election Results in the United States") +
  geom_polygon(aes(x = long, y = lat, fill = state_join$candidate, group = group),
               color = "white") + 
  coord_fixed(1.3) + scale_color_identity(
                          labels = c("Donald Trump", "Joe Biden"),
                          guide = "legend") + theme(legend.title = element_blank())

Problem 8

Color the map of the state of California by the winning candidate for each county. Note that some county have not finished counting the votes, and thus do not have a winner. Leave these counties uncolored.

cali= county %>% filter(region== 'california')

cali.winner=county.winner %>% filter(state=="California")
cali.winner$subregion= tolower(cali.winner$county)
cali_join= left_join(cali, cali.winner, by='subregion')

#map code
ggplot(data = cali_join) + ggtitle("Presidential Election Results in California") +
  geom_polygon(aes(x = long, y = lat, fill = candidate, group = group),
               color = "white") + 
  coord_fixed(1.3) 

Problem 9

Create a visualization of your choice by using census data.

##california census (povery rates in different california counties)
cali_cen= census %>% select(State, County, TotalPop, Income, Poverty, ChildPoverty) %>% group_by(State, County) %>% filter(State=='California')
cali_cen= cali_cen[-c(1)]

#organizing dataframe 'cen_demo'
colnames(cali_cen)= c("subregion", "TotalPop", "Income", "Poverty", "ChildPoverty")
cali_cen$subregion= tolower(cali_cen$subregion)
cali_cen$subregion= gsub("\\s*\\w*$", "", cali_cen$subregion)

#joining dataframes
cen_join= left_join(cali, cali_cen, by='subregion')

#mapping
par(mfrow=c(2,1))

ggplot(data = cali) + ggtitle("Presidential Election Results in California") +
  geom_polygon(aes(x = long, y = lat, fill = cali_join$candidate, group = group),
               color = "white") + 
  coord_fixed(1.3) +  scale_color_identity(
                          labels = c("Donald Trump", "Joe Biden"),
                          guide = "legend") + theme(legend.title = element_blank())

ggplot(data = cali) + ggtitle("Poverty in California") +
  geom_polygon(aes(x = long, y = lat, fill = cen_join$Poverty, group = group),
               color = "white") + 
  coord_fixed(1.3) +
  guides("legend")

Using the census data set, I tried seeing if there was a correlation between the amount of poverty in certain counties in California to the county being predominantly Republican or Democratic. Overall, it generally seems as if the less poverty there is, the more chance there is of being Democratic. The higher the poverty rate, the county seems to be more Republican.

Problem 10

census.clean = na.omit(census) %>%
  mutate(Men =100*(Men/TotalPop), Employed = 100*(Employed/TotalPop), VotingAgeCitizen= 100*(VotingAgeCitizen/TotalPop))
census.clean$Minority = census.clean$Hispanic+census.clean$Black+census.clean$Asian+census.clean$Pacific+census.clean$Native
drops = c("IncomeErr", "IncomePerCap", "IncomePerCapErr", "Walk", "PublicWork", "Construction","Hispanic","Black","Asian","Native","Pacific", "Women","ChildPoverty","Mino")
census.clean = census.clean[ , !(names(census.clean) %in% drops)]
head(census.clean,n=5)
## # A tibble: 5 x 25
##   CountyId State County TotalPop   Men White VotingAgeCitizen Income Poverty
##      <dbl> <chr> <chr>     <dbl> <dbl> <dbl>            <dbl>  <dbl>   <dbl>
## 1     1001 Alab… Autau…    55036  48.9  75.4             74.5  55317    13.7
## 2     1003 Alab… Baldw…   203360  48.9  83.1             76.4  52562    11.8
## 3     1005 Alab… Barbo…    26201  53.3  45.7             77.4  33368    27.2
## 4     1007 Alab… Bibb …    22580  54.3  74.6             78.2  43404    15.2
## 5     1009 Alab… Bloun…    57667  49.4  87.4             73.7  47412    15.6
## # … with 16 more variables: Professional <dbl>, Service <dbl>, Office <dbl>,
## #   Production <dbl>, Drive <dbl>, Carpool <dbl>, Transit <dbl>,
## #   OtherTransp <dbl>, WorkAtHome <dbl>, MeanCommute <dbl>, Employed <dbl>,
## #   PrivateWork <dbl>, SelfEmployed <dbl>, FamilyWork <dbl>,
## #   Unemployment <dbl>, Minority <dbl>

I also decided to drop the variable Women as it can be easily predicted with our values of men and total population, thus making it useless in prediction. As for the remaining features, I decided to take the correlation of poverty and child poverty as the observations seem to follow very similar patterns in terms of slope. With a correlation of value .9328, I decided to get rid of it. Intuitively, this would make perfect sense as with most children being dependent on their parents' wealth or lack thereof it would make sense that they are highly correlated variables and almost perfectly colinear excluding some special cases. In addition, I took a look at the variables Minority and White and saw that these are also almost perfectly negatively correlated at a value of -.9973. With this, I decided to get rid of the Minority variable a well. I also decided to look at the correlation coefficeint between poverty and income as I thought there would be a strong negative correlation there. However, with a correlation value of -.7646 I did not feel like there was strong enough of a case to drop this variable. ## Problem 11

pr.out = prcomp(census.clean[,-c(1:3)], scale = TRUE, center = TRUE)
pc.county = data.frame(pr.out$rotation[,1:2])
arrange(pc.county,PC1)[,1:2]
##                          PC1         PC2
## Poverty          -0.39519558 -0.11026775
## Unemployment     -0.36333539 -0.08392321
## Minority         -0.32163531 -0.20162620
## Service          -0.22030132 -0.14424445
## Drive            -0.12406425  0.41959144
## MeanCommute      -0.09360250  0.16346541
## Production       -0.09286140  0.29211482
## Office           -0.08353357  0.15856107
## Carpool          -0.06970064 -0.06612375
## OtherTransp      -0.02425523 -0.22867662
## Men               0.02016044 -0.13401583
## TotalPop          0.02665276 -0.02773789
## VotingAgeCitizen  0.02848391  0.07441859
## Transit           0.04320677 -0.12909339
## PrivateWork       0.05534238  0.43346387
## FamilyWork        0.08671102 -0.20742929
## SelfEmployed      0.15606080 -0.30725057
## WorkAtHome        0.24107653 -0.32313951
## Professional      0.25540023 -0.17744244
## White             0.31803621  0.20680452
## Income            0.33349960  0.03586265
## Employed          0.37440250  0.05665303

Here, I decided to get rid of CountyId in addition to State and county because while CountyId represents numeric variables, they are predetermined labels for each county. We chose to center the features because this is required prior to taking the PCA. I also chose to scale the features as not all variables are on the same scale; therefore it is necessary. The features with the highest absolute loading values are Poverty, Employed and Unemployment from highest to lowest.

Looking at PC1 loading values, we see the features Poverty, Unemployment, Service, Drive, Production, MeanCommute, Office, Carpool, VotingAgeCitizen with negative loading values. On the otherhand, we see the features Employed, Income, Professional, WorkAtHome, White, SelfEmployed, FamilyWork, Transit, TotalPop, Men, PrivateWork and OtherTransp with positive loading values. This would imply some sort of negative correlation between these features. Looking at some of these values, this would make sense as Employed and Unemployment would obviously be negatively correlated as well as Income and Poverty.

Problem 12

pr.var = pr.out$sdev^2
plot(pr.var, xlab = "Principal Component",ylab= "Variance of Principle Component", ylim = c(0,200))

pve = pr.var/sum(pr.var)
plot(cumsum(pve), xlab="Principal Component ",
ylab=" Cumulative Proportion of Variance Explained ", ylim=c(0,1), type='b')
abline( h= .9, col = "red")

The number of principal components needed to capture 90% of the variance is 12.

Problem 13

set.seed(1)
census.dist = dist(census.clean[,-c(1:3)])
census.hclust = hclust(census.dist)
clus = cutree(census.hclust,10)
## dendrogram: branches colored by 10 groups
dend1 = as.dendrogram(census.hclust)
# color branches and labels by 3 clusters
dend1 = color_branches(dend1, k=10)
dend1 = color_labels(dend1, k=10)
# change label size
dend1 = set(dend1, "labels_cex", .5)
dend1 = set_labels(dend1, labels=census.clean$County[order.dendrogram(dend1)])
plot(dend1, horiz = T, main = "Dendrogram of Counties Based on Features(10 Clusters)")

plot(dend1[[2]][[2]][[2]][[2]][[2]][[2]][[2]][[1]][[1]], main = "Santa Barbara County Cluster Based on Features",horiz = T)

pc.score = data.frame(pr.out$x[,1:2])
pc.dist = dist(pc.score)
pc.hclust = hclust(pc.dist)
clus2 = cutree(pc.hclust, 10)
## dendrogram: branches colored by 10 groups
dend2 = as.dendrogram(pc.hclust)
# color branches and labels by 10 clusters
dend2 = color_branches(dend2, k = 10)
dend2 = color_labels(dend2, k = 10)
# change label size
dend2 = set(dend2, "labels_cex", .4)
dend2 = set_labels(dend2, labels=census.clean$County[order.dendrogram(dend2)])
plot(dend2, horiz = T)

After searching through the first dendrogram, I have identified Santa Barbary County to be in the pink cluster by looking through all the endpoints. This being the largest cluster, it does not tell us much about Santa Barbara county besides that it has a small distance to many counties based on each feature. In the dendrogram created through PC1 and PC2 score values, there is a more even split in clusters meaning that contrary to the first dendrogram, Santa Barbara is in a more specific cluster. ### Classification

# we move all state and county names into lower-case
tmpwinner <- county.winner %>% ungroup %>%
  mutate_at(vars(state, county), tolower)

# we move all state and county names into lower-case
# we further remove suffixes of "county" and "parish"
tmpcensus <- census.clean %>% mutate_at(vars(State, County), tolower) %>%
  mutate(County = gsub(" county|  parish", "", County)) 

# we join the two datasets
election.cl <- tmpwinner %>%
  left_join(tmpcensus, by = c("state"="State", "county"="County")) %>% 
  na.omit
# drop levels of county winners if you haven't done so in previous parts
election.cl$candidate <- droplevels(election.cl$candidate)

## save meta information
election.meta <- election.cl %>% select(c(county, party, CountyId, state, votes, pct, total))

## save predictors and class labels
election.cl = election.cl %>% select(-c(county, party, CountyId, state, votes, pct, total))

Problem 14

We excluded party as a predictor variable as it not a numeric vector.

set.seed(12) 
n <- nrow(election.cl)
idx.tr <- sample.int(n, 0.8*n) 
election.tr <- election.cl[idx.tr, ]
election.te <- election.cl[-idx.tr, ]
set.seed(20) 
nfold <- 10
folds <- sample(cut(1:nrow(election.tr), breaks=nfold, labels=FALSE))
calc_error_rate = function(predicted.value, true.value){
  return(mean(true.value!=predicted.value))
}
records = matrix(NA, nrow=5, ncol=2)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","logistic","lasso","rf","boosted")

Problem 15

set.seed(123)
election.tree = tree(candidate ~.,data=election.tr)
draw.tree(election.tree, nodeinfo = TRUE, cex = .5)

cv= cv.tree(election.tree, FUN=prune.misclass, rand = folds)
best_size=  min(cv$size[cv$dev==min(cv$dev)])
pt.cv= prune.misclass(election.tree, best= best_size)
draw.tree(pt.cv, nodeinfo=TRUE, cex=0.5)

pred.pt.cv.te= predict(election.tree, election.te, type="class")
pred.pt.cv.tr = predict(pt.cv, election.tr, type = "class")
records[1,2]=calc_error_rate(pred.pt.cv.te, election.te$candidate)
records[1,1]=calc_error_rate(pred.pt.cv.tr, election.tr$candidate)
records
##          train.error test.error
## tree      0.08100446  0.0987055
## logistic          NA         NA
## lasso             NA         NA
## rf                NA         NA
## boosted           NA         NA

The pruned tree has a slightly lower classified rate as it uses less variables in its deecision making. From the pruned tree, we can see that in terms of percentage of people in county is less than 1.15% who transit, Donald Trump has 83.6% of the votes. Moving down from here we notice that for those counties that have over 48.95% of the total population being white, there is a 91.7% chance that they voted for Trump here. We can also see that for those that have an Unemployment rate of higher than 6.75%, Joe Biden takes around 55.6% of these votes. On the right side of our decision tree, we can see that of the 379 counties that voted for Biden that have a Transit percentage of over 1.15, around 60.9% of these counties also voted for Biden if the Total Population was greater than 131021. Of the remaiining, 39.1% of counties that voted for Trump with a Total Population of less than 131021, those that have less than 18.95% of county population working in service have a 83.3% chance of voting for Trump. Moving further down from service, we see that of the 84 counties that voted for Trump, around 83.3% of those with less than a 45.15% working in professional field, 83.3% of those people votedd for Trump. Going down for Total Population again, we see that of those Counties, if teh White Population was greater than 80.5%, there was around a 16.5% chance of them voting for Donald Trump with the remainder voting for Joe Biden.

election.tr
## # A tibble: 2,469 x 23
##    candidate TotalPop   Men White VotingAgeCitizen Income Poverty Professional
##    <fct>        <dbl> <dbl> <dbl>            <dbl>  <dbl>   <dbl>        <dbl>
##  1 Joe Biden    68364  48.3  71.4             73.9  59684    13.3         41.8
##  2 Donald T…   134327  48.9  78.3             74.9  46213    17.4         29  
##  3 Donald T…    10000  53.1  79.1             75.9  31933    26.2         23.8
##  4 Donald T…   310186  48.4  77.1             77.6  46475    17           27.6
##  5 Donald T…    34667  49.6  64.5             73.3  55793    13.2         32.7
##  6 Donald T…    10893  49.7  87.6             74.9  36893    17.3         26.9
##  7 Donald T…    68520  48.9  93.3             79.2  38266    17.2         27.7
##  8 Donald T…    37763  51.0  92.9             78.2  44837    15.6         27.1
##  9 Donald T…    17711  50.3  96.2             77.3  51114    10.4         27.6
## 10 Donald T…     8700  51.7  95.8             74.1  49135    15.7         24.1
## # … with 2,459 more rows, and 15 more variables: Service <dbl>, Office <dbl>,
## #   Production <dbl>, Drive <dbl>, Carpool <dbl>, Transit <dbl>,
## #   OtherTransp <dbl>, WorkAtHome <dbl>, MeanCommute <dbl>, Employed <dbl>,
## #   PrivateWork <dbl>, SelfEmployed <dbl>, FamilyWork <dbl>,
## #   Unemployment <dbl>, Minority <dbl>
sum(election.tr$Transit<1.15 & election.tr$White <48.95 & election.tr$Unemployment> 6.75 )
## [1] 135

Problem 16

glm_fit= glm(candidate ~ .,
             data= election.tr, family= "binomial")
summary(glm_fit)
## 
## Call:
## glm(formula = candidate ~ ., family = "binomial", data = election.tr)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.10668  -0.24555  -0.10004  -0.03369   3.05940  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.738e+01  9.351e+00  -2.928 0.003416 ** 
## TotalPop          1.451e-06  6.050e-07   2.398 0.016485 *  
## Men               4.255e-02  4.336e-02   0.981 0.326466    
## White            -1.429e-01  6.711e-02  -2.130 0.033207 *  
## VotingAgeCitizen  1.473e-01  2.573e-02   5.723 1.05e-08 ***
## Income           -1.634e-05  1.691e-05  -0.966 0.333915    
## Poverty           2.824e-02  2.957e-02   0.955 0.339635    
## Professional      3.189e-01  4.066e-02   7.844 4.38e-15 ***
## Service           3.623e-01  5.089e-02   7.119 1.09e-12 ***
## Office            1.519e-01  5.035e-02   3.017 0.002550 ** 
## Production        1.953e-01  4.248e-02   4.598 4.26e-06 ***
## Drive            -2.063e-01  4.826e-02  -4.274 1.92e-05 ***
## Carpool          -2.113e-01  6.191e-02  -3.413 0.000642 ***
## Transit          -6.995e-02  1.026e-01  -0.682 0.495349    
## OtherTransp      -8.186e-03  9.979e-02  -0.082 0.934622    
## WorkAtHome       -8.245e-02  7.366e-02  -1.119 0.262980    
## MeanCommute       5.083e-02  2.371e-02   2.144 0.032017 *  
## Employed          2.502e-01  3.231e-02   7.745 9.59e-15 ***
## PrivateWork       6.757e-02  2.165e-02   3.121 0.001801 ** 
## SelfEmployed     -4.823e-02  4.711e-02  -1.024 0.305939    
## FamilyWork       -3.323e-01  3.961e-01  -0.839 0.401486    
## Unemployment      2.277e-01  4.595e-02   4.955 7.24e-07 ***
## Minority         -1.530e-02  6.560e-02  -0.233 0.815526    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2203.82  on 2468  degrees of freedom
## Residual deviance:  816.15  on 2446  degrees of freedom
## AIC: 862.15
## 
## Number of Fisher Scoring iterations: 7
exp(coef(glm_fit))
##      (Intercept)         TotalPop              Men            White 
##     1.288993e-12     1.000001e+00     1.043467e+00     8.668349e-01 
## VotingAgeCitizen           Income          Poverty     Professional 
##     1.158650e+00     9.999837e-01     1.028639e+00     1.375603e+00 
##          Service           Office       Production            Drive 
##     1.436622e+00     1.164084e+00     1.215721e+00     8.136014e-01 
##          Carpool          Transit      OtherTransp       WorkAtHome 
##     8.095099e-01     9.324388e-01     9.918479e-01     9.208533e-01 
##      MeanCommute         Employed      PrivateWork     SelfEmployed 
##     1.052142e+00     1.284294e+00     1.069906e+00     9.529106e-01 
##       FamilyWork     Unemployment         Minority 
##     7.172660e-01     1.255700e+00     9.848123e-01
pred.log.te= predict(glm_fit, newdata = election.te, type="response")
pred.log.labeled.te = ifelse(pred.log.te < .5, "Donald Trump","Joe Biden")
pred.log.tr = predict(glm_fit, election.tr, type = "response")
pred.log.labeled.tr = ifelse(pred.log.tr < .5, "Donald Trump","Joe Biden")
records[2,2]=calc_error_rate(pred.log.labeled.te, election.te$candidate)
records[2,1]=calc_error_rate(pred.log.labeled.tr, election.tr$candidate)

The significant variables in this model are TotalPop, White, VotingAgeCitizen, Poverty, Professional, Service, Office, Production, Drive, Carpool, Employed, MeanCommute, Employed, PrivateWork and Unemployment. There are many more significant variables here, than in our pruned decision tree. In analyzing the coefficient for the White variable, I will take exponentiating the coefficient getting a value of .88036. I will interpret this as a decrease of around 12% in votes for Biden as White increases by a percent. For those that are unemployed, there seems to be around a 25% increase for each percentage increase in unemployment. ## Problem 17

set.seed(5)
lambda.lasso = seq(1, 50) * 1e-4

x.train = as.matrix(election.tr[,-1])

y.train = election.tr$candidate
x.test = as.matrix(election.te[,-1])
lasso.mod <- cv.glmnet(x.train, y.train, alpha=1,lambda = seq(1, 50) * 1e-4,family = "binomial")
plot(lasso.mod)
abline(v = log(lasso.mod$lambda.min), col="red", lwd=3, lty=2)

bestlam =lasso.mod$lambda.min
lasso.coef=predict(lasso.mod,type="coefficients",s=bestlam)[2:22,]
bestlam
## [1] 0.0012
lasso.coef
##         TotalPop              Men            White VotingAgeCitizen 
##     1.569020e-06     0.000000e+00    -1.162029e-01     1.496512e-01 
##           Income          Poverty     Professional          Service 
##     0.000000e+00     4.079363e-02     2.390442e-01     2.798460e-01 
##           Office       Production            Drive          Carpool 
##     8.144856e-02     1.205952e-01    -1.357798e-01    -1.294359e-01 
##          Transit      OtherTransp       WorkAtHome      MeanCommute 
##     0.000000e+00     3.886652e-02    -2.096341e-03     1.952859e-02 
##         Employed      PrivateWork     SelfEmployed       FamilyWork 
##     2.145291e-01     5.793665e-02    -6.509845e-02    -2.725748e-01 
##     Unemployment 
##     1.990332e-01
lasso.pred.train = predict(lasso.mod, s = bestlam, newx = x.train,type = "response")
lasso.pred.train.labeled = ifelse(lasso.pred.train < .5, "Donald Trump","Joe Biden")
lasso.pred.test = predict(lasso.mod, s = bestlam, newx = x.test,type = "response")
lasso.pred.test.labeled = ifelse(lasso.pred.test < .5, "Donald Trump","Joe Biden")
records[3,2]=calc_error_rate(lasso.pred.test.labeled, election.te$candidate)
records[3,1]=calc_error_rate(lasso.pred.train.labeled,election.tr$candidate)

The best lambda value in this case is .0011. In comparison to a logistic model, the lasso model has a slightly lower training error and a much lower test error. The non-zero coefficient values for this best lambda value are Total Population, White, Voting Age Citizen, Poverty, Professional, Service, Office, Production, Drive, Carpool, Other Transp, WorkAtHome, MeanCommute, Employed, PrivateWork, SelfEmployed, FamilyWork and Unemployment

Problem 18

election_cand = ifelse(election.te$candidate == "Donald Trump",0,1)
pred.pt.cv.te.prob = predict(pt.cv, election.te)
tree.pred = prediction(pred.pt.cv.te.prob[,2], election_cand)
perf.tree = performance(tree.pred, measure = "tpr", x.measure = "fpr")
log.pred = prediction(pred.log.te, election_cand)
perf.log =performance(log.pred, measure = "tpr", x.measure = "fpr")
lasso.pred = prediction(lasso.pred.test, election_cand)
perf.lasso =performance(lasso.pred, measure = "tpr", x.measure = "fpr")
plot(perf.tree, col = 2, lwd =2,main = "ROC Curves")
abline(0,1)
par(new=TRUE)
plot(perf.log, col = 3, lwd = 2)
par(new = TRUE)
plot(perf.lasso, col = 4, lwd = 2)
legend("bottomright",legend = c("Pruned Decision Tree","Logistic Model","Lasso Model"), col = c(2:4),lty =1)

Problem 19

#RF model 
set.seed(5)
rf.election.train = randomForest(candidate~.-candidate, data = election.tr,importance = TRUE,ntree = 40)
rf.pred.te= predict(rf.election.train, newdata = election.te,type = "prob")
rf.pred.tr = predict(rf.election.train, newdata = election.tr,type = "prob")
election.rf.labeled.te = ifelse(rf.pred.te[,1]>.5, "Donald Trump","Joe Biden")
election.rf.labeled.tr = as.factor(ifelse(rf.pred.tr[,1]>.5, "Donald Trump","Joe Biden"))
records[4,2]=calc_error_rate(election.rf.labeled.te, election.te$candidate)
records[4,1]=calc_error_rate(election.rf.labeled.tr,election.tr$candidate)
rf.pred = prediction(rf.pred.te[,2], election.te$candidate)
perf.rf = performance(rf.pred, measure = "tpr", x.measure = "fpr")

Problem 20

#boosted model
set.seed(4)
election.boost.training = gbm(ifelse(candidate=="Donald Trump", 0, 1)~. ,data = election.tr, n.trees = 1000, shrinkage = .1, distribution = "bernoulli")
summary(election.boost.training)

##                               var    rel.inf
## White                       White 24.0170037
## Transit                   Transit 22.5487385
## TotalPop                 TotalPop 12.5190736
## Professional         Professional  7.5282181
## Employed                 Employed  5.0302233
## VotingAgeCitizen VotingAgeCitizen  3.5396303
## Minority                 Minority  3.5165478
## Income                     Income  3.4369720
## Service                   Service  3.3126322
## Unemployment         Unemployment  2.4710346
## Men                           Men  2.3080618
## Poverty                   Poverty  2.0548538
## Production             Production  2.0252723
## OtherTransp           OtherTransp  0.9241050
## SelfEmployed         SelfEmployed  0.9122087
## PrivateWork           PrivateWork  0.8934608
## Drive                       Drive  0.8247592
## Office                     Office  0.6817054
## WorkAtHome             WorkAtHome  0.5504330
## Carpool                   Carpool  0.4870107
## MeanCommute           MeanCommute  0.3613464
## FamilyWork             FamilyWork  0.0567090
election.boost.tr = predict(election.boost.training, newdata = election.tr, type = "response")
## Using 1000 trees...
election.boost.te = predict(election.boost.training, newdata = election.te, type = "response")
## Using 1000 trees...
election.boost.tr.labeled = ifelse(election.boost.tr < .5, "Donald Trump", "Joe Biden")
election.boost.te.labeled = ifelse(election.boost.te < .5, "Donald Trump", "Joe Biden")
records[5,1]=calc_error_rate(election.boost.tr.labeled,election.tr$candidate)
records[5,2]=calc_error_rate(election.boost.te.labeled,election.te$candidate)
boost.pred = prediction(election.boost.te, election.te$candidate)
perf.boost = performance(boost.pred, measure = "tpr", x.measure = "fpr")
#All ROC Curves, AUC values and Records
plot(perf.tree, col = 2, lwd =2,main = "ROC Curves")
abline(0,1)
par(new=TRUE)
plot(perf.log, col = 3, lwd = 2)
par(new = TRUE)
plot(perf.lasso, col = 4, lwd = 2)
par(new = TRUE)
plot(perf.rf, col = 5, lwd = 2)
par(new = TRUE)
plot(perf.boost, col = 6, lwd = 2)
legend("bottomright",legend = c("Pruned Decision Tree","Logistic Model","Lasso Model","RF Model","Boosted Model"), col = c(2:6),lty =1, cex = .5)

auc.tree = performance(tree.pred, "auc")@y.values
auc.log = performance(log.pred, "auc")@y.values
auc.lasso = performance(lasso.pred, "auc")@y.values
auc.rf = performance(rf.pred, "auc")@y.values
auc.boost = performance(boost.pred, "auc")@y.values
cbind(c("Pruned Decision Tree","Logistic Model","Lasso Model","RF Model","Boosted Model"), c(auc.tree,auc.log,auc.lasso,auc.rf,auc.boost))
##      [,1]                   [,2]     
## [1,] "Pruned Decision Tree" 0.8570153
## [2,] "Logistic Model"       0.9556515
## [3,] "Lasso Model"          0.958281 
## [4,] "RF Model"             0.950157 
## [5,] "Boosted Model"        0.9560243
records
##           train.error test.error
## tree     0.0810044552 0.09870550
## logistic 0.0619684083 0.07928803
## lasso    0.0627784528 0.07119741
## rf       0.0004050223 0.06472492
## boosted  0.0287565816 0.06472492

Looking at the AUC values for these curves, we see that the Lasso model has the highest AUC value. While rf and boosted models seem to give us a lower test error, this could be due to overfitting in our model. The random forest and boosted models have lower test errors than the decision tree, logistic and lasso models. So in terms of accuracy they seem better. However, in terms of proportion of true positives and negatives, the lasso model seems to do the best.

#calculate swing counties
swing=county.winner %>% filter(pct <= 0.525 & pct >= 0.475)
swing
## # A tibble: 496 x 7
## # Groups:   state, county [493]
##    state         county  candidate    party  votes  total   pct
##    <chr>         <chr>   <fct>        <fct>  <dbl>  <dbl> <dbl>
##  1 New Hampshire Acworth Joe Biden    DEM      293    569 0.515
##  2 Idaho         Ada     Donald Trump REP   130699 259389 0.504
##  3 Vermont       Addison Joe Biden    DEM      445    870 0.511
##  4 Massachusetts Agawam  Donald Trump REP     8021  16069 0.499
##  5 Colorado      Alamosa Donald Trump REP     3676   7519 0.489
##  6 Wyoming       Albany  Joe Biden    DEM     9091  18612 0.488
##  7 Vermont       Alburgh Joe Biden    DEM      550   1079 0.510
##  8 Maine         Alfred  Joe Biden    DEM      983   1995 0.493
##  9 Connecticut   Andover Joe Biden    DEM     1058   2091 0.506
## 10 Minnesota     Anoka   Donald Trump REP   104894 210474 0.498
## # … with 486 more rows
swing$subregion= tolower(swing$county)

county_join= left_join(swing, county, by= 'subregion')

ggplot(data = county_join) + 
  geom_polygon(aes(x = long, y = lat, fill = subregion, group = group),
               color = "white") + 
  coord_fixed(1.3) +
  guides(fill=FALSE)

hi=table(swing$candidate)
barplot(hi)

This election was very equally polarized, which made it difficult to predict the future president. We determined if a county was a "swing" county if the difference between the votes for Biden and Trump was 5% or less.

Using this information, there were 496 "swing" counties, with no particular inclination towards a candidate. 238 counties had a 5% or less inclination towards Donald Trump, while 258 counties were slightly learning towards Joe Biden. Out of the 2825 unique counties, 17.56% of the counties were considered "swing" counties in our calculations. If we chose the "swing" counties as the majority vote being between 45% and 55%, the swing counties would have consisted of 30%.

Looking at the map of the swing counties, the swing counties are evenly spread out across the country. Because of this, the decision of who the electoral votes goes to is not certain, because it is not just one state that is unsure of their preferred candidate.

Ultimately, because of how spread out and how many swing counties there are in the data, it is difficult to predict which presidential candidate the counties preferred.

On top of this, due to COVID-19 there were allowed mail-in ballots that were taken in until November 3, 2020. The time it takes for the mail-in ballots to arrive at poll stations could have been up to 1 week, so there is even more uncertainty added onto this data.

Problem 21

Based on our multiple models that we included in our analysis, we saw that the AUC value for the lasso model seemed to be the highest of all the models we chose. Interpreting this Lasso Model, the significant variables that we saw had an effect on the voting outcome were those such as Total Population, White, Voting Age Citizen, Poverty, Professional, Service, Office, Production, Drive, Carpool, Other Transp, WorkAtHome, MeanCommute, Employed, PrivateWork, SelfEmployed FamilyWork and Unemployment. All the other variables that had coefficients of 0 could have caused overfitting in the model. To visualize the errors on the map, I will be adding the predicted test election results for the lasso model to the samples test election dataset.

records[3,2]
## [1] 0.07119741

We saw the error of this to be .07119 which is a solid test error value with a lower training error value that makes sense. I feel like the significant variables chosen in this model make a lot of sense such as Professional, Poverty and Unemployment as these were variables that we heard of take a large role in our recent election. However, these insights are not enough as there were clearly more variables taken into account that had significance on our test data set for the lasso model. This would imply a lack of undersstanding and a need to expand our domain knowledge as I was not able to completely understand why the data was influenced by these variables completely.